import TroubleshootingLLMConnectivity from './common/troubleshooting-llm-connectivity.mdx';

# Choose a model

Choose one of the following models, obtain the API key, complete the configuration, and you are ready to go. Choose the model that is easiest to obtain if you are a beginner.

## Adapted models for using Midscene.js

Midscene.js supports two types of models, visual-language models and LLM models.

### Visual-Language models (VL models, ✨ recommended)

Midscene invokes some visual-language models (VL models) that can precisely locate the coordinates of target elements on the page without the helping of DOM information.

We recommend these VL models for UI automation since we find them already working well in most scenarios and cheap enough compared to LLM models. In addition, by invoking these models, Midscene can now drive not only web automation, but also Android, iOS and any other interfaces. This is a more intuitive and efficient way in UI Automation.

These VL models are already adapted for Midscene.js:

* [Qwen VL](#qwen3-vl-or-qwen-25-vl)
* [Doubao visual-language models](#doubao-vision)
* [`Gemini-2.5-Pro`](#gemini-25-pro)
* [`UI-TARS`](#ui-tars)

### LLM models (will be removed in next major version)

Multimodal LLMs that can understand text and image input. GPT-4o is an example of this kind of model.

Multimodal LLMs can only be used in web automation. The typical model of this kind is [GPT-4o](#gpt-4o). You can use [other LLM models](#other-llm-models) if you want.

## Models in depth

<div id="qwen3-vl-or-qwen-25-vl"></div>

### Qwen VL (✨ recommended)

Qwen-VL is an open-source model series released by Alibaba. It offers visual grounding and can accurately return the coordinates of target elements on a page. The models show strong performance for interaction, assertion, and querying tasks. Deployed versions are available on [Alibaba Cloud](https://help.aliyun.com/zh/model-studio/vision) and [OpenRouter](https://openrouter.ai/qwen).

Midscene.js supports the following versions:
* Qwen3-VL series, including `qwen3-vl-plus` (commercial) and `qwen3-vl-235b-a22b-instruct` (open source)
* Qwen2.5-VL series

We recommend the Qwen3-VL series, which clearly outperforms Qwen2.5-VL. Qwen3-VL requires Midscene v0.29.3 or later.

**Config for Qwen3-VL**

Using the Alibaba Cloud `qwen3-vl-plus` model as an example:

```bash
OPENAI_BASE_URL="https://dashscope.aliyuncs.com/compatible-mode/v1"
OPENAI_API_KEY="......"
MIDSCENE_MODEL_NAME="qwen3-vl-plus"
MIDSCENE_USE_QWEN3_VL=1 # Note: cannot be set together with MIDSCENE_USE_QWEN_VL
```

**Config for Qwen2.5-VL**

Using the Alibaba Cloud `qwen-vl-max-latest` model as an example:

```bash
OPENAI_BASE_URL="https://dashscope.aliyuncs.com/compatible-mode/v1"
OPENAI_API_KEY="......"
MIDSCENE_MODEL_NAME="qwen-vl-max-latest"
MIDSCENE_USE_QWEN_VL=1 # Note: cannot be set together with MIDSCENE_USE_QWEN3_VL
```

**Links**
- [Alibaba Cloud - Qwen-VL series](https://help.aliyun.com/zh/model-studio/vision)
- [Qwen on 🤗 HuggingFace](https://huggingface.co/Qwen)
- [Qwen on GitHub](https://github.com/QwenLM/)
- [Qwen on openrouter.ai](https://openrouter.ai/qwen)


<div id="doubao-vision"></div>

### Doubao visual-language models

Volcano Engine provides multiple visual-language models, including:
* `Doubao-seed-1.6-vision` (newer and better)
* `Doubao-1.5-thinking-vision-pro`

They perform strongly for visual grounding and assertion in complex scenarios. With clear instructions they can handle most business needs.

**Config**

After obtaining an API key from [Volcano Engine](https://volcengine.com), you can use the following configuration:

```bash
OPENAI_BASE_URL="https://ark.cn-beijing.volces.com/api/v3"
OPENAI_API_KEY="...."
MIDSCENE_MODEL_NAME="ep-..." # Inference endpoint ID or model name from Volcano Engine
MIDSCENE_USE_DOUBAO_VISION=1
```

**Links**
- [Volcano Engine - Doubao-1.5-thinking-vision-pro](https://www.volcengine.com/docs/82379/1536428)
- [Volcano Engine - Doubao-Seed-1.6-Vision](https://www.volcengine.com/docs/82379/1799865)

<div id="gemini-25-pro"></div>

### `Gemini-2.5-Pro`

Starting from version 0.15.1, Midscene.js supports the Gemini-2.5-Pro model. Gemini 2.5 Pro is a proprietary model provided by Google Cloud.

When using Gemini-2.5-Pro, set `MIDSCENE_USE_GEMINI=1` to enable Gemini-specific behavior.

**Config**

After applying for the API key on [Google Gemini](https://gemini.google.com/), you can use the following config:

```bash
OPENAI_BASE_URL="https://generativelanguage.googleapis.com/v1beta/openai/"
OPENAI_API_KEY="......"
MIDSCENE_MODEL_NAME="gemini-2.5-pro-preview-05-06"
MIDSCENE_USE_GEMINI=1
```

**Links**
- [Gemini 2.5 on Google Cloud](https://cloud.google.com/gemini-api/docs/gemini-25-overview)

<div id="ui-tars"></div>

### `UI-TARS`

UI-TARS is an end-to-end GUI agent model based on a VLM architecture. It takes screenshots as input and performs human-like interactions (keyboard, mouse, etc.), achieving state-of-the-art performance across 10+ GUI benchmarks. UI-TARS is open source and available in multiple sizes.

With UI-TARS you can use goal-driven prompts, such as "Log in with username foo and password bar". The model will plan the steps needed to accomplish the task.

**Config**

You can use the deployed `doubao-1.5-ui-tars` on [Volcano Engine](https://volcengine.com).

```bash
OPENAI_BASE_URL="https://ark.cn-beijing.volces.com/api/v3"
OPENAI_API_KEY="...."
MIDSCENE_MODEL_NAME="ep-2025..." # Inference endpoint ID or model name from Volcano Engine
MIDSCENE_USE_VLM_UI_TARS=DOUBAO
```

**Limitations**

- **Weak assertion performance**: It may not perform as well as GPT-4o or Qwen 2.5 for assertion and query tasks.
- **Unstable action planning**: It may attempt different paths on each run, so the operation path is not deterministic.

**About the `MIDSCENE_USE_VLM_UI_TARS` configuration**

Use `MIDSCENE_USE_VLM_UI_TARS` to specify the UI-TARS version with one of the following values:
- `1.0` - for model version `1.0`
- `1.5` - for model version `1.5`
- `DOUBAO` - for the Volcano Engine deployment

**Links**
- [UI-TARS on 🤗 HuggingFace](https://huggingface.co/bytedance-research/UI-TARS-72B-SFT)
- [UI-TARS on GitHub](https://github.com/bytedance/ui-tars)
- [UI-TARS - Model Deployment Guide](https://juniper-switch-f10.notion.site/UI-TARS-Model-Deployment-Guide-17b5350241e280058e98cea60317de71)
- [UI-TARS on Volcengine](https://www.volcengine.com/docs/82379/1536429)

<div id="gpt-4o"></div>
### `GPT-4o`

GPT-4o is a multimodal LLM by OpenAI that supports image input. This is the default model for Midscene.js. When using GPT-4o, step-by-step prompting generally works best.

The token cost of GPT-4o is relatively high because Midscene sends DOM information and screenshots to the model, and it can be unstable in complex scenarios.

**Config**

```bash
OPENAI_API_KEY="......"
OPENAI_BASE_URL="https://custom-endpoint.com/compatible-mode/v1" # Optional, if you want an endpoint other than the default OpenAI one.
MIDSCENE_MODEL_NAME="gpt-4o-2024-11-20" # Optional. The default is "gpt-4o".
```

<div id="other-llm-models"></div>
## Choose other multimodal LLMs

Other models are also supported by Midscene.js. Midscene will use the same prompt and strategy as GPT-4o for these models. If you want to use other models, please follow these steps:

1. A multimodal model is required, which means it must support image input.
1. The larger the model, the better it works. However, it needs more GPU or money.
1. Find out how to to call it with an OpenAI SDK compatible endpoint. Usually you should set the `OPENAI_BASE_URL`, `OPENAI_API_KEY` and `MIDSCENE_MODEL_NAME`. Config are described in [Config Model and Provider](./model-provider).
1. If you find it not working well after changing the model, you can try using some short and clear prompt, or roll back to the previous model. See more details in [Prompting Tips](./prompting-tips).
1. Remember to follow the terms of use of each model and provider.
1. Don't include the `MIDSCENE_USE_VLM_UI_TARS` and `MIDSCENE_USE_QWEN_VL` config unless you know what you are doing.

**Config**

```bash
MIDSCENE_MODEL_NAME="....."
OPENAI_BASE_URL="......"
OPENAI_API_KEY="......"
```

For more details and sample config, see [Config Model and Provider](./model-provider).

## FAQ

### How can I check the model's token usage?

By setting `DEBUG=midscene:ai:profile:stats` in the environment variables, you can print the model's usage info and response time.

You can also see your model's usage info in the report file.

### Get an error message "No visual language model (VL model) detected"

Make sure you have configured the VL model correctly, especially the `MIDSCENE_USE_...` config is set correctly.

## More

* To learn more about the model configuration, see [Config Model and Provider](./model-provider)
* [Prompting Tips](./prompting-tips)

<TroubleshootingLLMConnectivity />
